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Abstract 

The predicted effect of effective population size on the distribution of fitness effects and substitution rate is critically dependent on the 
relationship between sequence and fitness. This highlights the importance of using models that are informed by the molecular 
biology, biochemistry, and biophysics of the evolving systems. We describe a computational model based on fundamental aspects of 
biophysics, the requirement for (most) proteins to be thermodynamically stable. Using this model, we find that differences in 
population size have minimal impact on the distribution of population-scaled fitness effects, as well as on the rate of molecular 
evolution. This is because larger populations result in selection for more stable proteins that are less affected by mutations. This 
reduction in the magnitude of the fitness effects almost exactly cancels the greater selective pressure resulting from the larger 
population size. Conversely, changes in the population size in either direction cause transient increases in the substitution rate. 
As differences in population size often correspond to changes in population size, this makes comparisons of substitution rates in 
different lineages difficult to interpret. 
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Introduction 

Novel mutations that appear in a genome can be advanta- 
geous, increasing the resulting organism's fitness, deleterious, 
reducing the fitness, or effectively neutral, having such a small 
effect on fitness that the fate of the mutation in the popula- 
tion is dominated by random drift. The relative fraction of 
these three different types of mutations, and the form of 
the overall distribution of fitness effects (s = w' — w/w, 
where w and w' are the fitness of the wild type and 
mutant, respectively) caused by such mutations, has been a 
topic of interest and debate (Bustamante 2005; Eyre-Walker 
and Keightley 2007). Characterizing this distribution is essen- 
tial for understanding the nature of genetic variation, includ- 
ing polymorphisms that may cause or influence diseases, as 
well as characterizing the evolutionary dynamics. 

Larger population sizes result in increased magnitude of 
the selective pressure acting on mutations of a given value 
of s. The fitness effect and the effective population size 
N e generally appear as a product in many evolutionary 
and genetic calculations, so often equations reference the 
population-scaled fitness effect 5 = 4A/ e s (5 = 2N e s for 



haploid organisms). For instance, Pf K (s), the probability of a 
new mutation with fitness effect s being fixed in an otherwise 
homogeneous diploid population, relative to the probability of 
fixation of a neutral mutation P F °, is given by (Fisher 1930; 
Kimura 1957, 1962; Crow and Kimura 1970) 

P Fi x(5) (i^^) ^ 4A/ e s 5 

where the approximation is valid for small s. 

The effective population size affects the substitution rate 
differently depending up the relative number of advanta- 
geous, deleterious, and neutral mutations (Gillespie 1999). If 
there are a substantial number of adaptive mutations, whose 
probability of fixation is less dependent on the population size, 
the substitution rate would be higher in larger populations 
reflecting the greater number of mutations that arise. 
Conversely, if the mutations are either neutral or so deleterious 
as to have negligible fixation probability, as suggested by the 
neutral theory of molecular evolution (Kimura 1968, 1983), 
then the substitution rate would be relatively independent of 
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population size, with the greater number of mutations cancel- 
ling the decreased probability of fixation. The nearly neutral 
theory of evolution (Ohta 1973, 1992; Kimura 1983) empha- 
sizes the role of slightly deleterious mutations, whose proba- 
bility of acceptance is smaller in larger populations. This latter 
theory predicts that smaller populations should evolve faster. 

Observations of the population dependence of the substi- 
tution rate are difficult. One approach is to examine the 
dependence of the substitution rate on the rate of recombi- 
nation. In regions of the genome with low recombination 
rates, mutations in linked genes compete for fixation (Hill 
and Robertson 1966), which has a similar effect as a lower 
effective population size. High recombination rates reduce this 
effect, so that regions of the genome that recombine rapidly 
are characterized by an increased effective population size. It 
has been observed that regions in Drosophila genome with 
high recombination rates evolve slower than regions of low 
recombination rates, consistent with the predictions of the 
nearly neutral theory (Larracuente et al. 2008; Arguello et al. 
2010; Campos et al. 2012). Other studies have looked at the 
difference in the evolution of genes on sex chromosomes, 
observing higher rates of nonsynonymous substitutions on 
the nonrecombining chromosome (Wyckoff et al. 2002; 
Berlin and Ellegren 2006). These types of analysis assume 
that the mutation process in these two different types of 
regions, as well as the properties of the encoded proteins 
(e.g., expression levels, structure, and function), are not sys- 
tematically dissimilar in ways that affect the substitution rate. 
There are, for instance, correlations between recombination 
rate, GC content, mutation rate, and rate of biased gene con- 
version that remain to be elucidated (Hardison et al. 2003; 
Duret 2006; Duret and Arndt 2008). Sex chromosomes 
might also be subject to specific adaptive selection that 
cannot be easily distinguished from reduced selection. 

A more direct approach is to examine how the substitution 
rate differs in different lineages. For instance, substitution 
rates have been compared in primates and rodents (Wu and 
Li 1985; Ohta 1995; Weinreich 2001), although such compar- 
isons are compromised by differences in, for example, gener- 
ation time, cell division rate, metabolic rate, mating behavior, 
ecological niche, and DNA repair mechanisms (Bromham et al. 
1996; Bromham 201 1). Faster evolution has been observed in 
endosymbiotic bacteria and fungi compared with their free- 
living relatives (Moran 1996; Woolfit and Bromham 2003). 
Endosymbiotic bacteria and fungi would have their effective 
population size reduced by the lower population sizes of their 
host, and would also be expected to undergo population bot- 
tlenecks when relatively few endosymbionts are transmitted 
to progeny, reducing the intra-host variation; this process is 
generally modelled as reducing the effective population size 
(Rispe and Moran 2000). The faster evolution of the endosym- 
bionts is again consistent with the nearly neutral theory, al- 
though other biological or ecological characteristics of 
endosymbionts might complicate the analysis. Comparisons 



have been made of the rates of evolution of island and main- 
land populations, with the island populations again having a 
smaller population size due to the population bottleneck that 
occurs during colonialization as well as due to habitat restric- 
tion; some studies have concluded that the smaller popula- 
tions evolve faster (Johnson and Seger 2001; Woolfit and 
Bromham 2005), although other studies have reached differ- 
ent or more nuanced conclusions (Charlesworth and Eyre- 
Walker 2007; Wright et al. 2009). 

Theoretical models of these effects are often based on 
simple models of the fitness landscape, such that the distribu- 
tion of fitness effects is constant (Ohta 1977) or that the fit- 
ness of the mutant alleles has a fixed distribution (Kingman 
1978). These models can break down if the population is far 
from a fitness optimum due to mutation-selection balance, 
where the preponderance of deleterious mutations is bal- 
anced by the greater fixation probabilities of advantageous 
mutations (Hartl et al. 1985; Cherry 1998; Wylie and 
Shakhnovich 2011; Charlesworth 2013). Mutation-selection 
balance cannot be achieved if the distribution of selective 
coefficients is independent of fitness; a stable equilibrium 
requires that there be an increased tendency toward accep- 
tance of deleterious mutations as the fitness increases (Cherry 
1998). In the case of a fitness function that plateaus as the 
fitness increases, increasing the population size would result in 
a higher equilibrium fitness, which can reduce the fitness 
impact of mutations, resulting in a narrower distribution of 
fitness effects. Under some conditions, this contraction of the 
distribution in s can exactly cancel the explicit population size 
dependence of 5, so that the distribution of population-scaled 
fitness effects [p 5 (5)], and therefore the substitution rate, is 
independent of population size, even in the nearly neutral 
model (Cherry 1998; Charlesworth 2013). 

The evolutionary process involves modifications of interact- 
ing biological macromolecules. By creating evolutionary 
models that explicitly include the properties of these evolving 
biomolecules, we can develop more realistic models of the 
evolutionary process, better understand how the evolutionary 
dynamics depends on biological context, and improve our un- 
derstandings of how the properties of these biological mole- 
cules arose. To fulfill these ambitions, we need to create 
computational models that capture the salient aspects of 
the biology while still being computationally tractable. 

It has been noted that much of the selection pressure on 
coding regions involves maintaining an adequate degree of 
thermodynamic stability for the resulting expressed proteins 
(Wang and Moult 2001; Zeldovich et al. 2007; Drummond 
and Wilke 2008; Serohijos et al. 2012). This has led to studies 
investigating how these genetic regions would evolve where 
the fitness corresponds to a simple function of stability, such 
as the fraction of proteins that would be folded at equilibrium 
(Williams et al. 2006; Chen and Shakhnovich 2009; Goldstein 
201 1 ; Wylie and Shakhnovich 201 1 ; Pollock et al. 201 2). We 
investigate the distribution of selective effects generated by 
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such a simple model, including its dependence on the effective 
population size. Our results correspond roughly to the nearly 
neutral theory. We find that the distribution of population- 
scaled fitness effect [p 5 (5)] is essentially independent of the 
effective population size, suggesting that evolutionary dynam- 
ics, at least of regions of protein-coding genes where selection 
for stability dominates, should be similarly independent of 
population size. We observe, however, a strong dependence 
of the distribution of fitness effects, as well as overall substi- 
tution rate, on fluctuations in population size. This result can 
explain why differences in substitution rates have been ob- 
served in different lineages, and indicate that the effect of 
population bottlenecks on substitution rates cannot be mod- 
eled through an appropriate modification of the effective pop- 
ulation size. 



Materials and Methods 

Protein Model 

We consider a genome containing a 900-nucleotide gene, 
coding for a 300-residue protein, as described in earlier pub- 
lications (Williams et al. 2006; Goldstein 201 1; Pollock et al. 
2012). We use a simple fitness function based on protein 
stability, where the fitness w of a protein is equal to the prob- 
ability that the protein is folded at equilibrium P Fo ided, given by 

1 

W = Ppolded = 1+eAG /*7- (2) 

where AG is the difference in free energy between the folded 
and unfolded states, T is the temperature, and k is 
Boltzmann's constant. (Note that more negative values of 
AG correspond to higher stability.) 

The free energy G(5, Q) of a protein with sequence 
5 = {a<\,a 2 ,a 3 . . . a 30 o} in any given conformation Q is com- 
puted by summing the contact energies of all of the pairs of 
residues which are in contact in that conformation, where we 
use the contact energies determined by Miyazawa and 
Jernigan (1985) based on frequencies of contacts in known 
protein structures; residues are in contact if their Cp atoms (Q 
for glycine) are closer than 7 A to each other. We consider the 
native state of the protein to be the conformation of the 
purple acid phosphatase (PDB 1QHW; Lindqvist et al. 1999); 
the free energy for a given sequence in this particular native 
state is designated G NS (5). We assume that the distribution of 
free energies for the large ensemble of A/y unfolded states can 
be represented as a Gaussian distribution with mean G(5) and 
variance <j(5) 2 . We estimate G(5) and a(5) 2 by calculating the 
free energy of the sequence in a set of 55 alternative struc- 
tures. The free energy difference between the folded and 
unfolded states is then given by 

AG(S) = G N s(S) + ff(5)2 - 2 f^ (5 W ln/Vu (3) 



A/y is settolO 160 . T is set to 20°C.The probability of folding, 
and thus the fitness, is then calculated using equation (2). 

Evolutionary Model 

We initialize a nucleic acid sequence to a set of 300 random 
codons (excluding stop codons). The codons are translated 
into a protein sequence using the standard genetic code, 
and the free energy of folding (and organismal fitness) calcu- 
lated as described earlier. We simulate evolutionary dynamics 
where we assume that the mutation rate is slow relative to the 
fixation time, so that population variation can be ignored. 
We calculate the rate of all 3 x 900 possible single nucleotide 
substitutions, equal to the rate of mutation (using a K80 nu- 
cleotide substitution model [Kimura 1980] with a transition- 
transversion ratio of 2.0) times the probability of fixation of the 
mutation, calculated by computing the free energy of folding 
of the mutant and using equation (1). (Mutations resulting in 
stop codons are considered lethal.) We estimate the time 
to the next substitution by drawing from an exponential 
distribution with decay rate equal to the sum of all of the 
individual substitution rates, and choose a mutation to 
accept with probability proportional to its substitution rate. 
The protein sequence evolves with increasing stability 
(decreasing AG) until the point of mutation-selection bal- 
ance, where there is no further long-term change of stability. 
The simulation is then extended to an evolutionary interval of 
10 nucleotide substitutions expected per nucleotide location 
for neutral substitutions. Only the data subsequent to the 
establishment of mutation-selection balance are used in the 
subsequent analysis. These simulations are repeated 1 00 times 
with A/ e = 1 0 4 , A/ e = 1 0 6 , and N e = 1 0 8 

At each time point of the simulation, we calculate the 
effect of every possible single nucleotide mutation, and use 
all of these mutations to calculate the distribution of popula- 
tion-scaled fitness effects [p 5 (5)] as well as the instantaneous 
substitution rate, represented by the ratio of nonsynonymous 
to synonymous substitution rates (dA//d5). 

Explorations of Alternative Models 

How sensitive are the results to a particular model? Two as- 
pects of this model might be particularly relevant: 1) the spe- 
cific relationship between fitness and stability; 2) the epistasis 
between various locations in the protein in calculating the 
fitness. We examine these aspects sequentially. 

The relationship between protein stability and organismal 
fitness is still unclear and is possibly complicated (Bershtein 
et al. 2012). In particular, there are indications that avoiding 
aggregation may be more important than the concentration 
of the folded state (Chen and Dokholyan 2008; Zhang et al. 
2008; Johnson and Hummer 2011; Levy et al. 2012; Yang 
et al. 2012). A linear relationship has been observed between 
fitness cost and fraction of aggregated proteins (Geiler- 
Samerotte et al. 2011); such an effect would not greatly 
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change the model discussed earlier, as it still results in a linear 
relationship between fraction of folded protein and fitness, 
albeit with a different slope. To see how sensitive the results 
are to variations in the fitness function, we consider a different 
fitness function representing a fitness penalty for aggregation, 
which we model using a quadratic function of the amount of 
unfolded protein 

w = 1 - £(1 - P Fo , ded ) 2 (4) 

where § = 3x10 3 is chosen so that dw/dAG at 
AG =9kcal/mol is roughly similar to that in the original 
model represented by equation (2). Thirty simulations are 
made for each of the three values of N e using this fitness 
function. 

Removing Epistasis 

The contribution of every residue to the fitness depends on 
the amino acids at every other location in the protein. This is 
due to two different aspects of the model. First, the energetics 
are based on contact potentials, which are a function of pairs 
of amino acids which are in contact in the native or alternative 
structures. Second, the fitness is a nonlinear function of the 
free energy of folding, as indicated by equation (2). In partic- 
ular, an amino acid substitution at one location in the protein 
will cause a change in the protein's stability, but the effect of 
this change on the protein fitness will depend on the prior 
stability, which depends on the amino acids found in all other 
locations. This can be seen if we represent s as a function of 
the initial stability and change in stability (Wylie and 
Shakhnovich 2011) 



1 



1 



s = 



1 +e (AG+AAG)//rr -| +e AG/kT 

1 ; 
1 +Q AG/kT 



, e AG/kT^ _ Q AAG/kTy ^ 



To remove these sources of epistasis, we construct a model 
where the fitness contribution of every amino acid at each 
location is fixed. We first compute the fitness i/v Ref of the ref- 
erence sequence. We then calculate the change in fitness for 
every single amino acid change, to generate Aw/(a') = 
w({ai,a2, . . .3/, • • .3300}) - WRef, where w({a<\,a2, . . . 
aj, ... a3oo}) is the fitness of an amino acid sequence differing 
from the reference sequence by the single replacement of a] 
for dj. The fitness of any arbitrary sequence is represented as 

w({a\,a' 2 , ...a;, ...a 300 }) = Aw/(a;)+Aw Ref . (6) 

/ 

The fitness effect s/(a' a") corresponding to a mutation 
from amino acid a' to a" at position / is given by 

c ^ *"\ Aw/(a") - Aw/(aQ _ „ , 

s/(a -> a ) = — — « Aw/(a ) - Aw,(a ), 

wi(a') 

(7) 

where we have taken advantage of the fact that the fitnesses 
during the simulation are all close to unity. 



For calculating the resulting distribution of population- 
scaled fitness effects, we take advantage of the simplicity of 
the model to calculate the distributions by summing over all 
possible mutations from all possible codons at each location, 
weighted by the equilibrium probability of the original codon 
and the mutation, using the approaches described in Tamuri 
et al. (2012). We averaged over the results obtained with 10 
different reference sequences. 

Variations in Effective Population Size: Bottlenecks 

We also perform simulations with the original model (eq. 2) 
where the effective population size fluctuates between 10 6 
and 10 4 , representing periodic population bottlenecks, with 
equal amounts of evolutionary time spent at each population 
level. The period of the oscillations in units of evolutionary time 
vary between 0.001 and 1.0 expected neutral substitutions 
per location, with 10 simulations performed for each period. 

Results 

After an initial period, the free energy of folding of the pro- 
teins reached values of approximately AG — 7 for N e = 10 4 
to -12 kcalmor 1 for A/ e = 10 8 . This degree of stability is 
roughly similar to that observed in real proteins. It is important 
to note that this stability is far from optimum; we can use hill- 
climbing algorithms to find sequences with stabilities in the 
order of -1 18kcal/mol (Goldstein 201 1). This marginal stabil- 
ity is also observed with natural proteins, which can be mod- 
ified to have higher stabilities while retaining native-like 
activities (Serrano et al. 1993; Giver et al. 1998; Van den 
Burg et al. 1998; Zhao and Arnold 1999; Korkegian et al. 
2005). The stability of these modelled proteins represents mu- 
tation-selection balance, where the greater number of 
destabilizing mutations is compensated by the higher accep- 
tance rate for stabilizing mutations (Goldstein 201 1). 

Figure 1 shows the distribution of population-scaled fitness 
effects of nonsynonymous mutations for three different pop- 
ulation sizes varying over four orders of magnitude. As can be 
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Fig. 1. — Distribution of population-scaled fitness effects for nonsyn- 
onymous mutations when fitness is proportional to the fraction of proteins 
folded at equilibrium, calculated using equation (2), for N e = 10 4 (green), 
N e = 10 6 (blue), and N e = 10 s (red), on linear (A) and log (B) scales. 
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seen, the distributions are extremely similar. All of the distri- 
butions of deleterious mutations are strongly leptokurtic, fit- 
ting an inverted Gamma distribution with shape parameter 
a = 0.08. Approximately 25% of mutations are effectively 
neutral (-1 < 5 < 1), approximately 25% are mildly delete- 
rious (-10<5<-1), and 50% are strongly deleterious 
(5<-10). Although the distribution corresponds to the 
near-neutral theory, the substitution rate is nearly indepen- 
dent of population size, in agreement with the predictions 
of Cherry (1 998): the average ratio of nonsynonymous to syn- 
onymous substitution rates ((d/V/d5)) only changes minimally, 
from 0.350 for N e = 1 0 4 to 0.338 for N e = 1 0 8 

Explorations of Alternative Models 

To see how the results vary with the fitness function, we 
perform simulations using an alternative model based on 
aggregation, as represented by equation (4). The resulting 
distribution of population-scaled fitness effects (fig. 2A) is 
similar to the earlier model, with a rate of evolutionary 
change essentially independent of effective population size, 
with (d/V/d5) changing from 0.327 for N e = 10 4 to 0.323 for 
A/ e = 10 8 

Removing epistasis through the use of the model repre- 
sented by equation (7) results in a strong dependence of the 
distribution of 5 on the population size, as shown in figure 2B. 
There is also an extremely strong dependence of the substitu- 
tion rate with effective population size, with (dA//d5) chang- 
ing from 0.725 for N e = 10 4 to 0.357 for N e = 10 6 to 0.027 
for A/ e = 10 8 . 

Variations in Effective Population Size: Bottlenecks 

In contrast to differences in effective population size, there can 
also be fluctuations in effective population size. We perform 
simulations where the effective population size alternates be- 
tween N e = 1 0 6 and N e = 1 0 4 , as illustrated in the bottom of 



figure 3A, representing repetitive population bottlenecks. The 
resulting distribution of population-scaled fitness effects, for 
various timescales of population changes, is shown in 
figure 2C. With faster changes in population, there is a sub- 
stantial increase in the number of advantageous mutations. 
The dependence of the rate of evolution on the period of the 
population changes is shown in figure 3B. With faster and 
faster changes, there is approximately a doubling of the aver- 
age value of dA//d5. The instantaneous value of 6N/65 for a 
period of fluctuation of 0.1, averaged over 1,000 cycles, is 
shown in figure 3A. Both increases and decreases in effective 
population sizes cause increases in the instantaneous value of 
d/V/d5; when the population size increases, there is increased 
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Fig. 3. — {A) Instantaneous values of d/V/d5 during fluctuations in ef- 
fective population size, when the fluctuations have a period of 0.1. Both 
increases and decreases in population size cause transient increases in the 
rate of evolution. Periods represent durations in evolutionary time corre- 
sponding to expected number of base substitutions per nucleotide position 
under conditions of neutral evolution. (B) Averaged values of (d/V/dS) for 
fluctuating population sizes, as a function of the period of the fluctuations 
(blue). The value of (dA//d5) when the population is fixed at N e = 10 6 is 
shown for comparison (red). 
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Fig. 2. — (A) Distribution of population-scaled fitness effects for nonsynonymous mutations using a fitness model penalizing unfolded protein, based on 
equation (4), for N e = 10 4 (green), N e = '\0 6 (blue), and N e = 10 8 (red). (B) Distribution of population-scaled fitness effects for a model where epistasis has 
been removed, based on equation (7), with the same color scheme as (A). (Q Distribution of population-scaled fitness effects for effective population size 
oscillating between 10 4 and 10 6 , with the fitness calculated using equation (2), for various periods of the fluctuation: fixed at N e = 10 6 (blue), period = 0.1 
(red), 0.01 (green), and 0.001 (cyan). Periods represent durations in evolutionary time corresponding to expected number of base substitutions per nucleotide 
position under conditions of neutral evolution. 
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selection for greater stability, resulting in an increase in the 
number of stabilizing (adaptive) substitutions, while decreases 
in the population size results in a decrease in selective con- 
straints, resulting in increased acceptance of slightly 
destabilizing substitutions. 

Discussion 

Using a simple but reasonable model of protein thermody- 
namics to provide a fitness function, we find that the distri- 
bution of the population-scaled fitness effects and the 
substitution rate are remarkably unaffected by the effective 
population size. In contrast to the small dependence of these 
evolutionary parameters on N e , we find a strong effect from 
time varying effective population sizes. There is a large tran- 
sient increase in the number of adaptive substitutions when 
the population size increases, as the protein adapts to the 
greater degree of selective pressure; there is also a transient 
increase in the number of slightly deleterious substitutions 
when the population size decreases, as the selective pressure 
relaxes and the protein evolves to lower stabilities. This effect 
depends on the timescale of the population fluctuations, but is 
significant over a wide range. 

Why the Lack of Dependence on Effective Population 
Size? 

When a mutation occurs, the values of s and 5 corresponding 
to a given value of AAG is approximately given by 



dw 
d(AG) 



AAG 



(aki 6W \ 



(8) 



AAG 



where we have assumed that the fitness of the wild type is 
close to unity (true of these simulations) and that the magni- 
tude of AG is sufficiently small that a Taylor expansion is jus- 
tified. (This is also not a bad assumption, as mutations with 
very large destabilizing effects will be evolutionarily unimpor- 
tant, and mutations with very large stabilizing effects are ex- 
tremely rare.) (A more exact but less general calculation can be 
performed starting with eq. 5.) For any given values of N e and 
di/v/d(AG), 5 is proportional to AAG, so that distribution of 
population-scaled selective effects, p 5 (5), is then a stretched 
version of p AAG (AAG) given by 



Ps(S) : 



1 

M pAAG v; 



where X is given by 



X = 4A/p 



dw 
d(AG) " 



(9) 



(10) 



The stability is based on a large number of stabilizing and 
destabilizing interactions, involving residues throughout the 
protein. The magnitude of these interactions is on the same 



scale as the total free energy of folding, so that significant 
changes in AG can be caused by modifying only a few of 
these interactions. As a result the distribution of changes of 
free energy of folding, p AAG (A AG), is relatively unaffected by 
the stability AG, as long as the protein is not excessively (i.e., 
unrealistically) stable (Goldstein 2011), a conclusion that has 
been verified both by other simulations and experimental 
measurements (Bloom et al. 2005, 2006, 2007; Tokuriki 
et al. 2007). 

Although p AAG (A AG) is independent AG, the slope of the 
fitness function of equation (2) 



6w 



^AG/kT 



d(AG) ^ r ( 1+e AG//c7) 2 



(11) 



will be dependent on the protein stability, becoming closer to 
zero as the protein stability increases, so that X is dependent 
on AG as well as A/ e . 

As a protein evolves toward higher stability, the distribution 
of AAG is constant but the selective pressure relaxes until the 
expected change in fitness, or alternatively the equilibrium 
average value of 5 for accepted mutations, is approximately 
zero. We can describe this equilibrium condition as 



(S) 



Fixed 



j S P Fix (S) p s (S) d5 « 0 



1 



fsP FK (S) Paag(J) dS 



(12) 



where (5) Fixed represents the value of 5 averaged over fixed 
substitutions, and P F j X (5) is the fixation probability, which we 
are assuming, as in equation (1), is only a function of 5. Note 
that, as long as p AAG (AAG) is fixed, the only adjustable pa- 
rameter in equation (12) is X. There will be a certain value of 
X = A Eq where equation (12) is satisfied. (For the current 
model, this value is approximately A Eq = -0.766.) More neg- 
ative values result in a positive (5) Fixed , moving the system to a 
flatter region of the fitness curve, making X less negative, 
while less negative values result in a positive (5) Fixed . The 
result is that the free energy of folding will change, modifying 
dw/d(AG) until X = A Eq , so that the change in the slope of 
the fitness landscape cancels the effect of the changing pop- 
ulation size. At this equilibrium, the distribution of population- 
scaled fitness effects will be given by equation (9) with 
X = A Eq . The resulting distribution of 5 will be dependent on 
the value of A Eq , which will depend on the forms of 
p AAG (AAG) and P F j X (5), but not on the value of A/ e . 

The generality of this argument indicates that this observa- 
tion should not be dependent on a specific fitness function. 
What is required is for the fitness to be a concave function 
of some parameter and that this parameter is what Cherry 
terms an equimutable parameter, where the distribution of 
changes of this parameter with mutations is independent of its 
current value (Cherry 1998); for the current model, this is 
satisfied by the observed independence of p AAG (AAG) 
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on AG. It is likely that the characteristics of protein stability 
that provide for equimutability — that stability is a composite 
function that depends on contributions from many appro- 
priately sized terms, where the rapidly declining number of 
increasingly stable sequences means the stability is far from 
optimal — is common in biology. We use fraction of proteins 
folded (eq. 2) as our fitness function, but alternative formula- 
tions (avoiding of self aggregation, eq. 4) give similar results. 
The calculation of the fitness is highly epistatic, where the 
contribution of each amino acid to the fitness depends on 
the rest of the protein sequence. This epistasis is required 
for this population size independence. When the epistasis is 
removed and the fitness becomes the sum of a large number 
of contributions from simple states, the fitness function ceases 
to be a concave function of a composite property, and a 
strong population size dependence results, as shown in 
figure 2B. 

The results presented here indicate that changes in popu- 
lation size affect evolutionary dynamics quite differently from 
differences in population size, as has been suggested by 
Charlesworth and Eyre-Walker (2007) and Cherry (1998). In 
particular, they noted that large increases in population size 
can cause adaptive bursts that overcome the decrease in sub- 
stitution rate due to the stronger selective constraints. The 
work presented here also emphasizes the role of population 
changes, except in our model it is only the changes in popu- 
lation size that cause significant changes in the substitution 
rate. As a result, population increases of arbitrary size will 
cause increases in the substitution rate, as there is no decrease 
in the substitution rate to be overcome. Similarly, decreases in 
the population size will cause increases in the substitution rate, 
but this will only be a transient effect resulting from the 
change in population size, rather than the difference in pop- 
ulation size. 

How Does the Substitution Rate Depend on the 
Distribution of Mutational Effects? 

In addition to being largely independence of population size, 
the substitution rate will also be independent of the magni- 
tude of the effect of mutations on the protein stability; that is, 
scaling all of the values of AAG by a constant factor a will 
result in a change the stability of the protein so as to scale 
dw/d(AG) by resulting in the same distribution of p s (S) 
and thus the same substitution rate. The substitution rate, 
however, is dependent on the shape of p AAG (AAG) To ex- 
plore this dependence, we constructed a simpler model where 
a fraction p_ of all mutations is destabilizing (with change in 
free energy of folding AAG_), a fraction p 0 is neutral 
(AAG 0 = 0), and a fraction p+ is stabilizing (with change in 
free energy of folding AAG+). The protein stability AG is 
adjusted until equation (1 2) is satisfied, and the relative fitness 
of the three different types of mutants and the corresponding 
acceptance rates calculated using equation (1). Figure 44 
shows the dependence of the substitution rate (d/V/d5) 




0.2 0.4 0.6 0.8 

P- 



Fig. 4. — (A) Effect on d/V/d5 of changing the fraction of mutations 
that are deleterious (p_), computed using a simple model, where the 
effect of the mutation on the free energy of folding is equal to 
AAG_ = 1 kcal mol -1 (blue), AAG_ = 2 kcal mol -1 (red), 
AAG_ = 3 kcal mol -1 (green), AAG_ = 4 kcal mol -1 (purple), and 
AAG_ = 5 kcal mol -1 (orange). Other parameters as defined in the 
text. (B) Effect on d/V/d5 of changing the fraction of mutations that are 
advantageous (p+), computed using a simple model, where the effect of 
the mutation on the free energy of folding is equal to 
AAG+ = -2 kcal mol -1 (blue), AAG+ = -1.5 kcal mol -1 (red), 
AAG+ = -1 kcal mol -1 (green), AAG+ = -0.5 kcal mol -1 (purple), 
and AAG+ = -0.1 kcalmol -1 (orange). Other parameters as defined in 
the text. 



on the fraction p_ and effect AAG_ of the destabilizing 
mutations (p+ = 0.05, AAG+ = -0.5 kcalmol -1 , p 0 = 1- 
p+ - p_). As shown, the rate is relatively insensitive to the 
magnitude of the destabilization, but extremely sensitive to 
the relative fraction. Increasing AAG_ results in a correspond- 
ing stabilization of the protein, resulting in a movement to the 
flatter part of the fitness curve, reducing the impact of these 
deleterious mutations on the fitness. Compensation resulting 
from changes in p_ are much weaker. Modifying the fraction 
and effect of stabilizing mutations shows a smaller effect, with 
the substitution rate increasing both with the fraction and 
magnitude of the stabilization, as shown in figure 4B 
(p_ = 0.3, AAG_ = 2.0 kcal mol" 1 , p 0 = 1 - p+ - p_). 

Comparison with Experimental Observations 

The results presented here seem in contradiction with the 
observations described in the Introduction, where faster sub- 
stitution rates are observed in 1) regions of the genome un- 
dergoing slow recombination compared with regions 
undergoing fast recombination, 2) endosymbionts compared 
with their free living relatives, and 3) island populations com- 
pared with mainland populations. In each of these three cases, 
there seems to be faster evolution in smaller populations, as 
would be predicted based on the nearly neutral model. 

As remarked earlier, there are many complicated issues in 
these comparisons, so that it is difficult to conclude that the 
only relevant differences between these two sets are differ- 
ences in effective population size. We also note that the dif- 
ferences in substitution rates observed in these comparisons 
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are generally quite modest. For instance, Woolfit and 
Bromham (2005) observed a median increase in 6N/65 of 
only 20% in island populations compared with mainland pop- 
ulations, with no significant increase in overall substitution 
rate. Campos et al. (2012) observed that autosomal genes 
in nonrecombining regions in Drosophila had a 6N/6S ratio 
of approximately 45% higher than similar genes in recombin- 
ing regions of the genome. Although it is difficult to make 
quantitative comparisons with the simple models presented 
here, and it is difficult to estimate differences in effective pop- 
ulation sizes (Gossmann et al. 2011, 2012), position-specific 
measures of fitness, as represented by equation (7), result in 
d/V/d5 increasing by a over factor of over 26 as the population 
size is reduced from 1 0 8 to 1 0 4 , which suggests that it may be 
the weakness of effect of population size on substitution rate 
that requires an explanation. 

Additionally, these comparisons often interpret changes 
in effective population size as equivalent to differences in ef- 
fective population size. Island populations undergo severe 
population bottlenecks, and bottlenecks are generally consid- 
ered to reduce the effective population size. The analysis pre- 
sented here suggests that population bottlenecks affect 
evolutionary dynamics quite differently from constant differ- 
ences in population size, and it might be the population bot- 
tlenecks, with the resulting decrease and increase in selective 
constraints, that are affecting the substitution rate, while a 
static lower effective population size would have no such 
effect. Charlesworth and Eyre-Walker (2007), for instance, 
observed that differences in substitution rate between island 
and mainland populations depend upon whether a mainland 
population colonized an island (population size decrease in the 
island population) or an island population colonized a main- 
land (population size increase in the mainland population). 
Significantly, in the latter case, the mainland population gen- 
erally had a higher rate of evolution than the island popula- 
tion, as would be predicted by the model presented here. 
This indicates that comparisons between the evolution of 
different lineages should be interpreted with care, as it 
would be difficult to disentangle the very different ways 
that static population size differences and population size fluc- 
tuations contribute to substitution rates. This is an inherent 
problem with this type of comparisons, as related lineages 
with different effective population sizes must have experi- 
enced the changes in population size that caused these 
differences. 

A similar argument can be made comparing the effect of 
recombination rates on effective population size. Competition 
between mutations occurring at different points on a genet- 
ically linked region of the genome may correspond to reduced 
effective population sizes, but these mutations would occur 
sporadically. In this case, there would be temporal fluctuations 
in this effective population size, as other mutations with dif- 
ferent fitness effects occur in nearby genes. In this way, lack 
of recombination would result in variations in effective 



population size, increasing the rate of evolutionary change, 
as has been observed experimentally (Larracuente et al. 2008; 
Arguello et al. 2010; Campos et al. 2012). Regions of low or 
no recombination would also be more subject to selective 
sweeps, providing a further mechanism for rapid changes in 
effective population size. Again, as with lineage-specific sub- 
stitution rates, it is difficult to disentangle differences from 
fluctuations in effective population sizes. 

Why Do Some Proteins Evolve Faster than Others? 

As pointed out by Cherry (1998), with the exception of con- 
spicuous outliers, differences in the substitution rates in differ- 
ent proteins is surprisingly modest, varying by approximately 
an order of magnitude (Grishin et al. 2000). Differences in the 
mutation rate in different parts of the genome would contrib- 
ute to this rate variation, as would differences in the number 
of sites under nonthermodynamic constraints such as the re- 
quirements of functionality (Zuckerkandl 1976). In addition, as 
described earlier, the substitution rate is strongly dependent 
on p AAG (AAG), in particular on the fraction of destabilizing 
mutations. It is likely that this distribution is dependent on the 
size, structure, composition, and environment of the protein, 
leading to variation in the substitution rates. Finally, there has 
been significant interest in proteins that are unfolded under 
physiological conditions, or contain significant unstructured 
regions (Wright and Dyson 1999; Dunker et al. 2008). The 
selective constraints on these proteins and regions are still 
poorly understood. 

Limitations of the Model 

Evolutionary dynamics are dependent on the mapping be- 
tween genotype, phenotype, and fitness. In this article, we 
analyze a model of protein thermodynamics that provide a 
reasonable mapping between these quantities. It is known 
that achieving sufficient thermodynamic stability is an impor- 
tant selective constraint for many proteins (Wang and Moult 
2001; Zeldovich et al. 2007; Drummond and Wilke 2008; 
Serohijos et al. 2012). Although our model of thermodynam- 
ics is, by necessity, simplistic, it does include many realistic 
aspects, such as the need for considering differences between 
the free energy of the native state and a large ensemble of 
alternative states and the stability being a holistic function of 
many epistatic energetic interactions. Importantly, it reprodu- 
ces many known properties of proteins and their evolution, 
such as the observed marginal stability, the higher rate of 
evolution of exposed locations compared with buried loca- 
tions, the tendency for hydrophobic residues to cluster in 
the interior, the dependence of protein stability on population 
size, and over-dispersion of the molecular clock (Goldstein 
2011). 

Our measure of organismal fitness, the fraction of proteins 
folded at equilibrium, is certainly overly simple. There are likely 
to be specific requirements at particular locations in the 



Genome Biol. Evol. 5(9): 1 584-1 593. doi:10.1093/gbe/evt1 10 Advance Access publication July 24, 2013 



1591 



Goldstein 



GBE 



protein, necessary for achieving functionality. These require- 
ments on the protein sequence will, however, likely be suffi- 
ciently rigid so that changes in these locations would 
contribute minimally to the evolutionary dynamics. Other 
properties, such as resistance to aggregation, are also likely 
significant (Chen and Dokholyan 2008; Zhang et al. 2008; 
Johnson and Hummer 2011; Levy et al. 2012; Yang et al. 
2012). Using a different fitness function based on avoiding 
aggregation, as represented by equation (4), did not signifi- 
cantly change the results. As long as the fitness function is a 
concave function of free energy of folding, increasing the 
population size will move the protein to a higher, and corre- 
spondingly flatter, region of the fitness landscape, and will 
result in the population-independent substitution rates de- 
scribed here. This is likely true even if the fitness is a concave 
function of some other quantity besides protein stability (such 
as saturation kinetics in biochemical reactions; Hartl et al. 
1985), as long as this quantity is an aggregate quantity de- 
pendent on overall properties of the protein sequence that 
fulfils the equimutability criterion. 

An important caveat of this analysis is the assumption of a 
slow mutation rate, so that the time for fixation (or elimina- 
tion) is short relative to the length of evolutionary time be- 
tween mutations. Genetic variation in the population would 
affect the substitution rate and would also be dependent 
on the population size. Wylie and Shakhnovich (2011), for 
instance, have observed in a simple model that the distribution 
of fitness effects depends on the mutation rate, indicating that 
the presence of multiple mutations in the population has an 
effect. Similar complications can emerge if the timescale for 
fluctuations in population size become comparable with or 
shorter than the fixation time (Otto and Whitlock 1997). 
Neglecting this effect is a limitation of this work. 
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